🚨 EP: fix EP router contract for many models + honor FP8 scale format by IlyasMoutawwakil · Pull Request #46818 · huggingface/transformers

IlyasMoutawwakil · 2026-06-22T15:24:30Z

What does this PR do?

Fixes # (issue)

Code Agent Policy

The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by
code agents. We are currently bottlenecked by our ability to review and respond to them. As a result,
we ask that new users do not submit pure code agent PRs at this time.
You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents
not to open any PRs or issues for the moment.

PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this
repeatedly or maliciously.

This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result,
this policy is likely to be updated regularly in the near future. For more information, please read CONTRIBUTING.md.

I confirm that this is not a pure code agent PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline and the
Pull Request checks?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes according to the guidelines?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2026-06-22T15:37:24Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

3outeille · 2026-06-23T08:01:12Z

        return Fp8Quantize(self.hf_quantizer)
-
-
-class Fp8DecodeScale(ConversionOps):


any ideas as to why this part was dropped ?

because i added support for ue8m0 scales in finegrained-fp8 v3, this was needed for minimax m3 with the v2, but not anymore, it also wastes memory

ue8m0 scales are a bit messy, some store them in the correct torch dtype, some store them in uint8, and some even store them in fp32 for no special reason 😭 i'm trying to tighten the contract and honor the config all the times because supporting all the on-disk variations would be more complicated

okay ! Just to be sure, if we remove it now, it would not break existing checkpoints that are in mxpf8 format right ?

no they will work fine, even better because I just noticed that the fp32 scales are even avoiding the optimized mxfp8 path in https://github.com/huggingface/kernels-community/blob/aeb8ef0e09a132a6583c0a4c8b1096292922b54a/finegrained-fp8/torch-ext/finegrained_fp8/utils.py#L64 I also ran minimax m3 integration tests on the b200

@ArthurZucker

Yep sounds good, just require the version of the kernel for that path to error out properly if kernel version not installed

we do pin the v3 in our lazy loading

3outeille · 2026-06-23T08:24:43Z

-            intermediate_size = config.moe_intermediate_size * config.n_shared_experts
-            self.shared_experts = DeepseekOcr2TextMLP(config=config, intermediate_size=intermediate_size)
+        self.n_routed_experts = config.n_routed_experts
+        self.num_experts = config.n_routed_experts


redundancy in variables ?

yeh i guess we can drop n_routed_experts, removing it

hmm so it seems to cascade into many models

3outeille · 2026-06-23T08:34:14Z

maybe better to add _skip_if_ep_not_supported here instead of within test_ep_*?

tell me if this works for you 0288a11

vasqu

Only checked the modeling parts re modular and the models themself. It is slightly breaking technically because we move parts around modules so let's add 🚨

Generally aligned with this, just a bit unsure about the minimax m3 change - are we keeping everything as is without dequanting and then only convertin after all conversions? Not sure I can follow there 100%

vasqu

Just some quick comments, ping me when it's ready for another review. Seems like some comments were resolved but not addressed?

IlyasMoutawwakil · 2026-06-24T08:51:33Z

+            if not self._ep_plan:
+                raise ValueError(
+                    f"Expert parallelism was requested (`enable_expert_parallel=True`), but "
+                    f"`{self.__class__.__name__}` does not define an expert-parallel plan. Add a "
+                    f"`base_model_ep_plan` to its config, or disable expert parallelism."
+                )


loud failure on missing ep plan

IlyasMoutawwakil · 2026-06-24T08:58:25Z

    index_head_dim: int = 128
    index_n_heads: int = 64
    mlp_bias: bool = False
-    num_experts: int = 256


this was confusing

IlyasMoutawwakil · 2026-06-24T10:14:36Z

+    def _process_model_after_weight_loading(self, model, **kwargs):
+        # dsv4-flash-base stores its (power-of-two) ue8m0 scales in a float32 container under
+        # `.scale`; those renamed keys keep the on-disk float32 dtype, so cast them to the UE8M0
+        # dtype the kernels expect (exact, since the values are powers of two). Checkpoints that
+        # already ship the native float8 E8M0 dtype (e.g. dsv4-flash) are left untouched.
+        if self.quantization_config.scale_fmt == "ue8m0":
+            from ..integrations.finegrained_fp8 import _get_ue8m0_dtype
+
+            ue8m0 = _get_ue8m0_dtype()
+            float32_scales = [
+                name
+                for name, param in model.named_parameters()
+                if name.endswith("_scale_inv") and param.dtype == torch.float32
+            ]
+            for name in float32_scales:
+                module_name, _, attr = name.rpartition(".")
+                module = model.get_submodule(module_name)
+                scale = getattr(module, attr)
+                setattr(module, attr, torch.nn.Parameter(scale.data.to(ue8m0), requires_grad=False))
+        return model


either like this or by hooking a quantization op to the scale rename op

okay the second option didn't work

I kinda prefer with a fp8DecodeScale

why does it have to be post proc?

Fp8DecodeScale did the opposite, and it targeted mxfp8 where it converted truly ue8m0 to fp32,
this is for for dsv4-flash-base, we need the opposite, ie convert fp32 to ue8m0 to honor the config scale_fmt (because for some reason they stored their ue8ù0 scales in fp32😭), that way we avoid casting, with a new mem allocation, at the entry of each kernel.

why does it have to be post proc?

because the rename catches the dsv4 flash base scales first

vasqu

Ok so I think this looks overall good now, just a few smaller comments. Sometimes we add an attribute mapping so that all variations are kind of covered, not sure if we really need it for all models (would just double check)

The quants re minimax m3 were checked re dequant and quant so I think we are good with the changes but would like to hear @ArthurZucker's opinion on those related changes

vasqu · 2026-06-24T12:22:27Z

Let's also update the PR description please so we summarize the changes a bit

FP8 scale changes
EP Plans for all moes
- Refactor along all models to follow the same format as router/gate -> experts (-> shared experts)
- Additional miscallenous stuff like erroring out on moes that should have the plan

ArthurZucker

LGTM, let's make sure kernel V is enforced

ArthurZucker · 2026-06-24T13:05:17Z

        return Fp8Quantize(self.hf_quantizer)
-
-
-class Fp8DecodeScale(ConversionOps):


Yep sounds good, just require the version of the kernel for that path to error out properly if kernel version not installed

ArthurZucker · 2026-06-24T13:40:49Z

        if self.layer_types is None:
            self.layer_types = ["deepseek_sparse_attention"] * self.num_hidden_layers
+
+        if (num_experts := kwargs.get("num_experts")) is not None:


mmm is this really something we want? let's not warn no?

We had 2 values n_routed_experts and num_experts so it's for BC in any case a user explicitly sets this

Removed the warning, it could indeed trigger unnecessarily

ArthurZucker · 2026-06-24T13:42:58Z

-        "layers.*.mlp.experts.gate_up_proj": "grouped_gemm",
-        "layers.*.mlp.experts.gate_up_proj_scale_inv": "grouped_gemm",
-        "layers.*.mlp.experts.down_proj": "grouped_gemm",
-        "layers.*.mlp.experts.down_proj_scale_inv": "grouped_gemm",
-        "layers.*.mlp.experts": "moe_tp_experts",


ow shit IDK how this slipped in !

ArthurZucker · 2026-06-24T13:43:26Z

+        del self.topk_method
+        self.norm_topk_prob = config.norm_topk_prob

    def forward(self, hidden_states):


we can probably push standards but its fine

(meaning other models do this as well exactly potentially?)

Not sure what you mean here? It's the same as dsv2 (with a slightly different forward --> no norming at the end of the probs)

ArthurZucker · 2026-06-24T13:45:41Z

+    def _process_model_after_weight_loading(self, model, **kwargs):
+        # dsv4-flash-base stores its (power-of-two) ue8m0 scales in a float32 container under
+        # `.scale`; those renamed keys keep the on-disk float32 dtype, so cast them to the UE8M0
+        # dtype the kernels expect (exact, since the values are powers of two). Checkpoints that
+        # already ship the native float8 E8M0 dtype (e.g. dsv4-flash) are left untouched.
+        if self.quantization_config.scale_fmt == "ue8m0":
+            from ..integrations.finegrained_fp8 import _get_ue8m0_dtype
+
+            ue8m0 = _get_ue8m0_dtype()
+            float32_scales = [
+                name
+                for name, param in model.named_parameters()
+                if name.endswith("_scale_inv") and param.dtype == torch.float32
+            ]
+            for name in float32_scales:
+                module_name, _, attr = name.rpartition(".")
+                module = model.get_submodule(module_name)
+                scale = getattr(module, attr)
+                setattr(module, attr, torch.nn.Parameter(scale.data.to(ue8m0), requires_grad=False))
+        return model


I kinda prefer with a fp8DecodeScale

ArthurZucker · 2026-06-24T13:45:50Z

+    def _process_model_after_weight_loading(self, model, **kwargs):
+        # dsv4-flash-base stores its (power-of-two) ue8m0 scales in a float32 container under
+        # `.scale`; those renamed keys keep the on-disk float32 dtype, so cast them to the UE8M0
+        # dtype the kernels expect (exact, since the values are powers of two). Checkpoints that
+        # already ship the native float8 E8M0 dtype (e.g. dsv4-flash) are left untouched.
+        if self.quantization_config.scale_fmt == "ue8m0":
+            from ..integrations.finegrained_fp8 import _get_ue8m0_dtype
+
+            ue8m0 = _get_ue8m0_dtype()
+            float32_scales = [
+                name
+                for name, param in model.named_parameters()
+                if name.endswith("_scale_inv") and param.dtype == torch.float32
+            ]
+            for name in float32_scales:
+                module_name, _, attr = name.rpartition(".")
+                module = model.get_submodule(module_name)
+                scale = getattr(module, attr)
+                setattr(module, attr, torch.nn.Parameter(scale.data.to(ue8m0), requires_grad=False))
+        return model


why does it have to be post proc?

ArthurZucker · 2026-06-24T13:49:20Z

+        parallelism = "Expert" if expert_parallel else "Tensor"
+        # An EP-capable MoE (@use_experts_implementation) must ship an ep_plan; assert before any
+        # skip so a plan-less model fails even where the parallel test can't run (GPU, old torch).
+        if expert_parallel and self._get_tp_model_class()._can_set_experts_implementation():


perfect, we want good default EP plan evailable

yeah, we can also make use_experts_impl take care of adding the ep_plan to the config at model init time for example

github-actions · 2026-06-24T15:35:04Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: afmoe, cohere2_moe, deepseek_ocr2, deepseek_v2, deepseek_v3, deepseek_v32, dots1, ernie4_5_moe, ernie4_5_vl_moe, exaone_moe, flex_olmo, glm4_moe, glm4_moe_lite, glm4v_moe, glm_moe_dsa, hunyuan_v1_moe

github-actions · 2026-06-24T15:53:33Z

CI Dashboard: View test results in Grafana

vasqu · 2026-06-24T17:23:39Z

Need to check whether we need to update various conversion mappings; so withholding to merge for now

honor the quant config's scale format and refuse

777c272

IlyasMoutawwakil changed the title ~~FP8: Honor the quant config's scale format~~ FP8: Honor the quant config's scale format and fix EP Jun 22, 2026

IlyasMoutawwakil added 5 commits June 22, 2026 08:39

fix fp4 specific

122adb5

strict

98034c9

deeper ep fix

f1e2235

test

83625bc

style

b34872e

IlyasMoutawwakil marked this pull request as ready for review June 22, 2026 19:52

github-actions Bot requested review from ArthurZucker and Rocketknight1 June 22, 2026 19:53

IlyasMoutawwakil changed the title ~~FP8: Honor the quant config's scale format and fix EP~~ EP+FP8: fix EP router contract for many models and honor FP8 scale format Jun 22, 2026

add assertion

0e0550a

3outeille reviewed Jun 23, 2026

View reviewed changes

IlyasMoutawwakil added 3 commits June 23, 2026 02:13

more ep plans

4a585fe

fold tp+ep checks and ep assert into one helper

0288a11

style

7d5976d

IlyasMoutawwakil requested a review from 3outeille June 23, 2026 09:28

IlyasMoutawwakil and others added 2 commits June 23, 2026 02:52

rasie propper error upon ep request with no ep plan

8ea20bf

Merge branch 'main' into fix-glm-dsa

d288a9e

vasqu reviewed Jun 23, 2026

View reviewed changes

IlyasMoutawwakil changed the title ~~EP+FP8: fix EP router contract for many models and honor FP8 scale format~~ 🚨 EP: fix EP router contract for many models + honor FP8 scale format Jun 23, 2026

pcuenca mentioned this pull request Jun 23, 2026

[glm-mode-dsa] Indexer uses interleaved rope #46842

Open

6 tasks

IlyasMoutawwakil added 2 commits June 23, 2026 05:13

address anton's comments and make more modular

d7a2fea

fix repo

99f28f1

vasqu reviewed Jun 23, 2026

View reviewed changes

Comment thread src/transformers/models/lfm2_moe/modular_lfm2_moe.py Outdated

Comment thread src/transformers/models/deepseek_v2/modular_deepseek_v2.py Outdated

IlyasMoutawwakil added 2 commits June 23, 2026 15:28

more modular

b8f8eed

more modular dsv2 topK router

4b04deb

IlyasMoutawwakil commented Jun 24, 2026

View reviewed changes

IlyasMoutawwakil added 2 commits June 24, 2026 02:00

modular phimoe router

c015bd2

fix

6cf5856

IlyasMoutawwakil commented Jun 24, 2026

View reviewed changes

reverting phimoe changes

c837936

vasqu approved these changes Jun 24, 2026

View reviewed changes

IlyasMoutawwakil and others added 5 commits June 24, 2026 05:44

last modular attempt

9cd6feb

correct fix ?

31f2139

add BC variation just in case

5c2b2af

clearer message

918d6c4

post init workaround?

c54aee9

ArthurZucker approved these changes Jun 24, 2026

View reviewed changes

remove the warning

6de8bff

vasqu mentioned this pull request Jun 24, 2026

feat(model): add bailing v2.6 model #46713

Open

6 tasks

vasqu added 2 commits June 24, 2026 17:25

Merge branch 'main' into fix-glm-dsa

d1dcde0

fix CI

54ab3ee

ci

8d9846b

vasqu added this pull request to the merge queue Jun 24, 2026

vasqu removed this pull request from the merge queue due to a manual request Jun 24, 2026

		return Fp8Quantize(self.hf_quantizer)


		class Fp8DecodeScale(ConversionOps):

Conversation

IlyasMoutawwakil commented Jun 22, 2026

What does this PR do?

Code Agent Policy

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Jun 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Jun 23, 2026 •

edited

Loading

IlyasMoutawwakil Jun 24, 2026 •

edited

Loading

IlyasMoutawwakil Jun 24, 2026 •

edited

Loading

IlyasMoutawwakil Jun 24, 2026 •

edited

Loading